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The eigenvalue-function pair of the 3D Schrodinger equation can be efficiently computed by use 
of high order, imaginary time propagators. Due to the diffusion character of the kinetic energy 
operator in imaginary time, algorithms developed so far are at most fourth-order. In this work, we 
show that for a grid based algorithm, imaginary time propagation of any even order can be devised 
on the basis of multi-product splitting. The effectiveness of these algorithms, up to the 12*^ order, 
is demonstrated by computing all 120 eigenstates of a model Ceo molecule to very high precisions. 
The algorithms are particularly useful when implemented on parallel computer architectures. 

I. INTRODUCTION 

With the advance of the density functional method of solving diverse solid state physics and quantum chemistry 
problems^, it is of growing importance to solve the Schrodinger equation on a large 3-D mesh with greater than N = 10^ 
grid points. For such a large mesh, conventional matrix methods are impractical, since even the minimal matrix- vector 
multiplication would be prohibitively slow. Among 0{N) methods, we have previously shown that fourth-order 
imaginary time propagation^ provides an effective means of solving the Kohn-Sham and related equations^. The use 
of all forward time-step fourth-order algorithms in solving the imaginary time Schrodinger equation has since been 
adapted by many research groups^^^^^. 

The lowest n states of the one-body Schrodinger equation 

H^j{r) = Eji>j{r) (1.1) 

with Hamiltonian 

H = -^V^ + V{v) =T^V (1.2) 
2m 

can be obtained in principle by applying the evolution operator (e = — At) 

r(e) = e^(^+^) (1.3) 
repeatedly on the -^-th time step approximation {?/^^-^\r)} to the set of states {?/^j(r), 1 < j < n}, 

</>f ^) ^ T(.)^f (1.4) 

and orthogonalize the states after every step, 

i 

The method is made practical by approximating the exact evolution operator (|1.3p by a general product form, 

M 

T{e) = Y[e''^'^e^^'^. (1.6) 

The simplest second order decomposition, or the split operator method^^, is 

T2(e) = e^'^e'^e^'^ = T(e) + O(e^). (1.7) 

When this operator acts on a state '0j(r), the two operators ei^^ correspond to point-by-point multiplications and e^^ 
can be evaluated by one complete (forward and backward) Fast Fourier Transform (FFT). Both are 0{N) processes. 
For '72(e), |e| has to be small to maintain good accuracy and many iterations are therefore needed to project out the 
lowest n states. To achieve faster convergence, one could in principle iterate higher order algorithms at larger time 
steps. Unfortunately, Sheng^ and Suzuki^ have proved that, beyond second order, no factorization of the form (|1.6p 
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can have all positive coefficients {a^, bi}. This forward time step requirement is essential for imaginary time propaga- 
tion because if any were negative, then the operator e~^^^^^ would be unbounded, resulting in unstable algorithms 
corresponding to unphysical backward diffusion in time. To derive forward, all positive time step fourth-order algo- 
rithms, Suzuki^^ and Chin^^ have shown that a correction to the potential of the form [V, [T, y]] = (Ti^ /'m)\VV\^ ^ 
as ffist used by Takahashi-Imada^^ and later suggested by Suzuki^^, must be included in the decomposition process. 
We have shown^ previously that these forward fourth-order algorithms can achieve similar accuracy at an order-of- 
magnitude larger step sizes than the second-order splitting (jl.7p . More recently Bandrauk, Dehghanian and Lu^^ have 
suggested that, instead of including such a gradient term, one can use complex coefficients {ai^hi} having positive 
real parts. For real time propagation, their complex time-step algorithms are not left-right symmetric and there- 
fore are not time-reversible. For imaginary time propagation, their fourth-order algorithm requires five complete 
complex-to-complex FFTs, whereas our forward algorithm 4A only needs two real-to-complex/complex-to-real FFTs^. 



II. MULTI-PRODUCT EXPANSION 

If the decomposition of T(e) is restricted to a single product as in (|1.6p . then there is no practical means of 
implementing a sixth or higher order forward algorithn>i^. However, if this restriction is relaxed to a sum of products. 



^CfcJ^e^^'^^V'^'^^^ (2.1) 



then the requirement that {a/c,i, ^/c,i} be positive means that each product can only be second order. Since 72(e) 
is second order with positive coefficients, its powers 7^^(e/A:) can form a basis for such a multi-product expansion. 
Recent work^^ shows that such an expansion is indeed possible and takes the form 

ee(T+V) ^ j- ^^jk ^ o(^2„+l) (2.2) 

where the coefficients Ck are given in closed form for any n: 

^=\[ (2-3) 

with {/ci, /c2, . . . kn} = {1,2,... n}. Since the symmetric ^(e) has only odd powers in e, 

T2(e) = exp[e(T + F) + e^Es + e^E^ + • • •] (2.4) 

where Ei are higher order commutators of T and V, the expansion (|2.2p is just a systematic extrapolation which 
successively removes each odd order error. Explicitly, for n = 2 to 5, we have the following order 4 to order 10 
multi-product expansion: 

T4(e) = -iT2(e) + ^T22(|) (2.5) 

r (A- ^ T(f) 64 2 /ex 6561 3 /ex 16384 /ex 390625 5 /ex 

- 8640^'^'' - 945^2 I2J + 4480^2 UJ " UJ + ^257^^^ UJ ' ^^'^^ 

Since each T2 requires one complete EFT, the above series of 2n-order algorithms only requires n{n + l)/2 complete 
FFTs. Thus algorithms of order 4, 6, 8 and 10 only require 3, 6, 10 and 15 complete FFTs. The low order extrapolation 
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(j2.5p has been used previously^''. Here, we have a systematic expansion to any even order. Note that Romberg- type 
extrapolation^^ such as 

Te(e) = -lT,(e) + l^T/(|), (2.9) 

which triples the number of FFTs in going from order 2n to 2n + 2, is not competitive with Eq. (|2.2p 's linear increase 
of only n + 1 additional FFTs. 

Since some coefficients are negative, this requires that the corresponding product, when acting on state 
be subtracted. This is doable for a grid based discretization of the wave function, which is just a point-by-point 
subtraction. 



III. QUANTUM WELL MODEL OF C60 

To demonstrate the workings of this new family of algorithms, we apply them to a model potential with nontrivial 
geometry, that of a 3D Ceo molecule. The effective attraction of the carbon ions is modeled by a potential of the form 

'^W-^ cosMlr-V.I/.) 

where are the locations of the carbon atoms in the Ceo cage. The strength Vb was chosen 1 in units of Ti' /2m and 
the width of the troughs d = 0.05 a.u. This potential accommodates 120 bound states as needed for a Ceo calculation. 
We have also applied the method in 2D to a square grid of 9x9 quantum dots described by the same potential. The 
convergence behavior of the algorithms in both cases are practically identical and need not be discussed separately. 

While the above potential is not a realistic description of a Ceo molecule, it serves to highlight an important and 
realistic aspect of any such computation: In the 3D case, the lowest 120 eigenvalues consist of two groups of 60 almost 
degenerate eigenvalues, one centered around an energy of -5.245 /2m with an average level spacing of 0.0034 n^/2m, 
the other one around -2.992 Ti^ /2m with an average level spacing of 0.0077 Ti^ /2m. Degenerate eigenvalues pose 
a notorious problem for eigenvalue solvers that contain an orthogonalization step. In this implementation of the 
algorithm, we use the subspace orthogonalization method described in Ref. 3, which is an application of the Petrov- 
Galerkin method^. We again find that the method works well in the present case; the convergence rates for the highest 
and the lowest states are the same. 

Figs. [Hand [2] show the convergence of various algorithms for both the lowest and the highest state of the model Ceo 
molecule. To generate the figures, we started with a time step of At = 0.5 and a set of plane wave initial states in 
an appropriate box. We then reduced the time step by a factor of 0.9 each time after convergence has been reached, 
so that the power law behavior can be seen cleanly. In a realistic calculation, it suffices to reduce the time step by 
a factor 0.5. In other words, the 12*^ order algorithm can reach the 10~^^ error level in just two iterations. The 
reduction factor of 0.5 used here is empirical. Recently, Lehtovaara, Toivanen and Eloranta^ have suggested that the 
time step size can be optimally adjusted with added efforts. 

While the convergence rate of the eigenvalues verified the order of the algorithms, in practice, it is also useful to 
monitor the variance of all states with respect to the evolution operator, 

= E l^"WV'i(rfc) - e^^-^,(rfc)f (3.2) 

k 

Only states with > 7, where 7 is a prescribed error bound, need to be propagated and orthogonalized. As soon 
as all states have converged at a certain time step e, their variances with respect to the Hamiltonian, 

= E - E^^Ji^k)\' , (3.3) 

k 

are calculated. If Rj^ < 7 for all states j, the iterations are terminated, otherwise the time step is reduced and the 
whole process is repeated, taking the result of the previous iteration as initial values. 

IV. PARALLELIZATION 

The advantage of high-order propagation methods is particularly compelling on parallel computer architectures: 
The propagation step (|1.4p can be parallelized efficiently without having to abandon the advantage of using FFTs by 
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simply distributing the states tpj across different processors. In such an arrangement, however, the parahehzation of 
the orthogonahzation step is notoriously difficult. Let Tpro be the propagation time, i.e., the time it takes to carry 
out step (|1.4p for all states, and Tort the orthogonahzation time. Then, the time Ttot for one iteration step on an ideal 
machine with N processors is 

Ttot(A^)= W^ + ^ort, (4.1) 

and the speed up ratio for the "propagation only" and the total time step including orthogonahzation for the j^^ 
order algorithm is 

C^U) (^r^ _ ^pro/tot (1) . . 

assuming the number of states is larger than the number of processors allocated for the task. The actual speed-up 
ratio will be less than this ideal since we have neglected communication overhead and other hardware/system specific 
issues. 

Fig. [3] shows the speed up ratio for the Ceo model calculation in the case of the 2"^, 6*^, and 12*^ order algorithms 
on a 256 Itaniun>i^ processor Altix— machine for up to twelve threads. We show the two speed-up ratios Spro{N) 

and S^{l{N). No particular effort was made to parallelize the orthogonahzation step. Evidently, the speed-up of the 
propagation step alone is a reasonably linear function of the number of threads. The performance improves significantly 
with the order of the algorithm because the increase computational effort for propagation can be distributed while 
the cost of communication remained the same. 

The 12*^ order algorithm can reach about 80 percent of the optimal performance. The actual speed-up is limited 
by the orthogonahzation step; while the 12*^ order algorithm can still attain a more than five-fold speed-up, it is 
hardly worth parallelizing the second order algorithm. The specific speed-up factor for higher order algorithms also 
depends on the number n of needed eigenstates . In general, the time for propagation is essentially proportional to 
n, whereas the time for orthogonahzation goes as n^. 

Thus, high order algorithms provide two advantages: they have faster convergence at larger time steps and are 
more adaptive to parallel computing environments. In the single processor mode, we have determined that the 6*^ 
order algorithm performed the best. 



V. CONCLUSIONS 



The impressive convergence of our high-order algorithms has a computing cost: One propagation step of the 2n-th 
order algorithm is equivalent to n(n + l)/2 propagation steps of the second order algorithm. This cost is compensated 
by two effects: The first is shown in the figures: The much faster convergence as a function of time step implies that 
fewer iterations are needed to complete the calculation. The second advantage is less obvious; since orthogonahzation 
is carried out after the propagation step (|2.2p , the relatively costly number of orthogonahzation steps is dramatically 
reduced. 

The most likely use of the high-order algorithms will be in real-space implementations of density-functional theory. 
For realistic systems, one must include non-local pseudo-potentials. We have recently^ ^ implemented algorithm 4 A 
using pseudo-potentials of the Kleinman-Bylander form^^. Calculating the double commutator [V, [T, V]] for such non- 
local potentials is possible, but the computational cost is twice that of propagating just the potential. Moreover, if the 
electron density in the vicinity of the ion cores deviates from spherical symmetry, then some approximate treatments 
may degrade the order of the algorithm. Thus, our new algorithms, without needing the double commutator, should 
be even more effective for realistic density- functional calculations. 
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Time step At 



FIG. 1: (color online) The convergence of the 2^^ to 12*^ order algorithms for the lowest eigenstate of a model "Ceo" molecule 
are as shown by markers defined in the inset. The dashed lines, as a guide to the eye, are the power laws At^ forn = 2, 4, . . . , 12. 
Also shown is the convergence curve for algorithm 4A of Ref. l3- Its characteristic deviation from the At^ power law behavior 
at small At is due to discretization errors in evaluating the double commutator [V, [T, V]]. 



8 




2 4 6 8 10 12 
Number of threads N 

FIG. 3: (color online) The total speed-up time factor 5'tot (solid lines) and the propagation only (without orthogonalization) 
time speed-up factor 5'pro, as a function of parallel threads, for the 2^^, 6^^, and 12*^ order algorithm (filled squares, circles, 
and triangles, respectively). Also shown is the "ideal" speed-up factor (dotted line). 



